【教程】基于 Ansible 部署企业级高可用 K8S 集群
这是一篇开发文档, 面向开发人员以及 AI, 转载自我的文档站, 原文地址:
本文的开发环境为 Linux 系统, 使用 micro cli 来编辑文件, 请根据自身系统环境进行调整
基本概念
关于 Ansible
Ansible 是无代理的自动化工具,把配置与变更写成清晰、可重复的任务。它擅长跨多台主机做一致化配置,也适合做应用部署与批量操作。配合负载均衡器时,可把复杂变更拆成可控的滚动步骤。
Ansible 非常非常适合用于部署与管理 HAProxy ~

关于 Kubernetes 与 RKE2
Kubernetes(K8s) 是容器编排系统,负责调度、服务发现、滚动更新与故障自愈等核心能力。它的目标是把分布式应用的运行方式标准化,让运维流程更可控。
RKE2(RKE Government) 是 Rancher 提供的 Kubernetes 发行版,符合一致性标准,默认更偏向安全与合规,适合生产环境。

关于 Rocky Linux 与 SELinux
Rocky Linux 是开源的企业级操作系统,目标是与 RHEL 保持缺陷级兼容,生命周期稳定,适合长期运行的生产集群。

SELinux 是强制访问控制(MAC)机制,用于精细限制进程与资源的访问边界。Rocky Linux 默认启用并处于 enforcing 模式,建议按策略配置而不是关闭。

入门
安装 Ansible
安装 Ansible (以 yay 为例):
yay -S ansible
运行 ansible --version 可以查看版本信息.
yun@yun ~/V/a/yunzaixi-dev (main)> ansible --version
ansible [core 2.20.0]
config file = None
configured module search path = ['/home/yun/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.13/site-packages/ansible
ansible collection location = /home/yun/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.13.7 (main, Aug 16 2025, 15:55:01) [GCC 15.2.1 20250813] (/usr/bin/python)
jinja version = 3.1.6
pyyaml version = 6.0.3 (with libyaml v0.2.5)
Ansible 是基于 Python 实现的,因此安装 Ansible 前请确保你的开发环境里已经配置好 Python 环境
lablabs.rke2依赖netaddrPython 包,需额外安装. Arch Linux 可用sudo pacman -S python-netaddr.
安装版本管理工具
安装 git, gh (以 yay 为例):
yay -S git github-cli
运行 git version 与 gh version 可以查看版本信息.
yun@yun ~/V/a/yunzaixi-dev (main)> git version
git version 2.52.0
yun@yun ~/V/a/yunzaixi-dev (main)> gh version
gh version 2.83.1 (2025-11-13)
https://github.com/cli/cli/releases/tag/v2.83.1
登录 Github :
gh auth login --scopes workflow
根据提示操作即可.
准备云服务器
在一切开始之前,我们需要先准备用于部署集群的云服务器, 最小可用的生产级 HA(控制面 + etcd)通常是 3 台 rke2-serve(嵌入式 etcd)加上至少一台 rke2-agent , 因此我们至少需要 4 台云服务器才能进行接下来的步骤
为了方便运维,所有系统统一为 RockyLinux
选择 RockyLinux 的原因: 它是一个开源免费的企业级操作系统, 百分百兼容 RHEL, 且位于 RKE2 的支持矩阵中
RKE2 非常轻量,但有一些最低要求:
- 两个 RKE2 节点不能具有相同的节点名称。默认情况下,节点名称取自机器的主机名, 因此 linux 云服务器主机名不能相同
- 每台云服务器应至少具有 2 Core CPU,4 GB RAM,并使用 SSD 作为硬盘
- 开放防火墙特定端口
配置 SSH Config
添加如下代码到您的系统 SSH Config 中 ( HostName 处填写云服务器的公网IP地址) :
Host rke2-server1
HostName <你的公网IP地址1>
User root
Host rke2-server2
HostName <你的公网IP地址2>
User root
Host rke2-server3
HostName <你的公网IP地址3>
User root
Host rke2-agent1
HostName <你的公网IP地址4>
User root
Host rke2-agent2
HostName <你的公网IP地址5>
User root
上述代码为所有云服务器配置了ssh别名,这极大地简化了未来的运维操作,接下来上传ssh公钥到目标服务器上:
ssh-copy-id rke2-server1
ssh-copy-id rke2-server2
ssh-copy-id rke2-server3
ssh-copy-id rke2-agent1
ssh-copy-id rke2-agent2
如果之前重装过系统,你或许需要先清理 SSH 指纹:
ssh-keygen -R rke2-server1
ssh-keygen -R rke2-server2
ssh-keygen -R rke2-server3
ssh-keygen -R rke2-agent1
ssh-keygen -R rke2-agent2
根据提示操作即可.
完成后,即可免密码登录所有云服务器:
ssh rke2-server1
ssh rke2-server2
ssh rke2-server3
ssh rke2-agent1
ssh rke2-agent2
登录后提示, 没有使用抗量子加密算法未来会被黑客干掉 (那很战未来了) ,这个不管
** WARNING: connection is not using a post-quantum key exchange algorithm.
** This session may be vulnerable to "store now, decrypt later" attacks.
** The server may need to be upgraded. See https://openssh.com/pq.html
Last failed login: ~~ from ~~ on ssh:notty There were 31 failed login attempts since the last successful login.
初始化 Ansible 项目
初始化仓库
首先创建文件夹,假设项目名为 rke2-ansible
yun@yun ~/V/a/y/p/ansible (main)> mkdir rke2-ansible
yun@yun ~/V/a/y/p/ansible (main)> ls
rke2-ansible/
进入项目仓库,初始化 git, 创建 github 公共仓库:
cd rke2-ansible
git init
echo "# rke2-ansible" > README.md
git add .
git commit -m "chore: initial commit"
gh repo create rke2-ansible --private --source=. --remote=origin --push
下面这段代码是可选的,用于将新建的代码仓库声明为子仓库:
cd ..
rm -rf rke2-ansible/
git submodule add https://github.com/yunzaixi-dev/rke2-ansible.git ./rke2-ansible
规划目录结构
接下来划分项目结构:
mkdir -p inventories/prod \
group_vars \
host_vars \
playbooks \
roles
创建空文件:
touch ansible.cfg \
requirements.yml \
inventories/prod/hosts.yml \
group_vars/all.yml \
group_vars/rke2_servers.yml \
group_vars/rke2_agents.yml \
host_vars/rke2-server1.yml \
playbooks/site.yml \
playbooks/ping.yml \
playbooks/update-packages.yml \
playbooks/set-hostname.yml \
playbooks/disable-ssh-password.yml
目录结构如下:
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> tree
.
├── ansible.cfg
├── group_vars
│ ├── all.yml
│ ├── rke2_agents.yml
│ └── rke2_servers.yml
├── host_vars
│ └── rke2-server1.yml
├── inventories
│ └── prod
│ └── hosts.yml
├── playbooks
│ ├── disable-ssh-password.yml
│ ├── ping.yml
│ ├── site.yml
│ ├── update-packages.yml
│ └── set-hostname.yml
├── README.md
├── requirements.yml
└── roles
各目录与文件说明:
ansible.cfg: Ansible 全局配置,指定 inventory 与 roles_path.requirements.yml: Galaxy 依赖清单,用于安装lablabs.rke2角色.inventories/prod/hosts.yml: 生产环境主机清单与分组.group_vars/*.yml: 主机组变量,分别用于集群公共参数与 server/agent.host_vars/rke2-server1.yml: 单机变量,用于声明首个控制面初始化.playbooks/site.yml: 部署入口,包含系统准备与 RKE2 安装流程.playbooks/ping.yml: 连通性检查 Playbook,用于验证主机可达.playbooks/update-packages.yml: 批量更新 Playbook,用于升级系统软件包.playbooks/set-hostname.yml: 批量设置 hostname,保留-并清理非法字符.playbooks/disable-ssh-password.yml: 关闭 SSH 密码登录,仅允许密钥登录.roles/: Galaxy 下载的角色目录.
安装 Galaxy Role
micro requirements.yml :
roles:
- name: lablabs.rke2
version: "1.49.1"
lablabs.rke2是社区维护的 RKE2 Role,Github仓库地址: https://github.com/lablabs/ansible-role-rke2, 封装了官方安装脚本与服务管理逻辑.固定到1.49.1可确保部署过程可复现,降低上游更新带来的不确定性.
安装依赖:
ansible-galaxy role install -r requirements.yml -p roles
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-galaxy role install -r requirements.yml -p
roles
Starting galaxy role install process
- downloading role 'rke2', owned by lablabs
- downloading role from https://github.com/lablabs/ansible-role-rke2/archive/1.49.1.tar.gz
- extracting lablabs.rke2 to /home/yun/Vaults/admin/yunzaixi-dev/project/ansible/rke2-ansible/roles/lablabs.rke2
- lablabs.rke2 (1.49.1) was installed successfully
配置 Ansible
micro ansible.cfg ( interpreter_python 路径根据自身情况调整):
[defaults]
inventory = inventories/prod/hosts.yml
remote_user = root
host_key_checking = False
roles_path = ./roles
forks = 10
timeout = 30
deprecation_warnings = False
stdout_callback = default
result_format = yaml
interpreter_python = /usr/bin/python3
编写 inventory
micro inventories/prod/hosts.yml :
all:
children:
rke2_servers:
hosts:
rke2-server1:
rke2-server2:
rke2-server3:
rke2_agents:
hosts:
rke2-agent1:
rke2-agent2:
rke2_cluster:
children:
rke2_servers:
rke2_agents:
由于前面已经配置了 SSH Config , 此处可直接使用主机别名, 无需额外填写
ansible_host
连通性检查
micro playbooks/ping.yml :
- name: Ping all hosts
hosts: all
gather_facts: false
tasks:
- name: Ping
ansible.builtin.ping:
执行:
ansible-playbook playbooks/ping.yml
输出如下:
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/ping.yml
PLAY [Ping all hosts] ***********************************************************************
TASK [Ping] *********************************************************************************
ok: [rke2-agent1]
ok: [rke2-agent2]
ok: [rke2-server2]
ok: [rke2-server1]
ok: [rke2-server3]
PLAY RECAP **********************************************************************************
rke2-agent1 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
批量设置主机名
hostname 不能包含
_
micro playbooks/set-hostname.yml :
- name: Set hostname from SSH alias
hosts: all
become: true
vars:
raw_hostname: "{{ inventory_hostname | lower }}"
hostname_from_alias: "{{ raw_hostname | regex_replace('[^a-z0-9-]', '') | regex_replace('^-+', '') | regex_replace('-+$', '') }}"
tasks:
- name: Ensure hostname is not empty
ansible.builtin.assert:
that:
- hostname_from_alias | length > 0
fail_msg: "Derived hostname is empty. Check inventory_hostname: {{ inventory_hostname }}"
- name: Set hostname
ansible.builtin.hostname:
name: "{{ hostname_from_alias }}"
执行:
ansible-playbook playbooks/set-hostname.yml
结果如下:
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/set-hostname.yml
PLAY [Set hostname from SSH alias] **********************************************************
TASK [Gathering Facts] **********************************************************************
ok: [rke2-server3]
ok: [rke2-server2]
ok: [rke2-server1]
ok: [rke2-agent2]
ok: [rke2-agent1]
TASK [Ensure hostname is not empty] *********************************************************
ok: [rke2-server1] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-server2] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-server3] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-agent1] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-agent2] => {
"changed": false,
"msg": "All assertions passed"
}
TASK [Set hostname] *************************************************************************
changed: [rke2-agent1]
changed: [rke2-server1]
changed: [rke2-server3]
changed: [rke2-server2]
changed: [rke2-agent2]
PLAY RECAP **********************************************************************************
rke2-agent1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
禁用 SSH 密码登录 (可选)
执行前请确认已配置密钥登录,避免被锁在服务器外.
micro playbooks/disable-ssh-password.yml :
- name: Disable SSH password authentication
hosts: all
become: true
tasks:
- name: Write SSH hardening config
ansible.builtin.copy:
dest: /etc/ssh/sshd_config.d/99-disable-password.conf
mode: "0644"
content: |
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
notify: Restart sshd
- name: Validate sshd config
ansible.builtin.command: sshd -t
changed_when: false
handlers:
- name: Restart sshd
ansible.builtin.service:
name: sshd
state: restarted
执行:
ansible-playbook playbooks/disable-ssh-password.yml
输出如下:
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/disable-ssh-password.yml
PLAY [Disable SSH password authentication] **************************************************
TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent1]
ok: [rke2-server3]
ok: [rke2-agent2]
ok: [rke2-server1]
ok: [rke2-server2]
TASK [Write SSH hardening config] ***********************************************************
changed: [rke2-server3]
changed: [rke2-agent1]
changed: [rke2-server2]
changed: [rke2-server1]
changed: [rke2-agent2]
TASK [Validate sshd config] *****************************************************************
ok: [rke2-server3]
ok: [rke2-agent1]
ok: [rke2-server2]
ok: [rke2-agent2]
ok: [rke2-server1]
RUNNING HANDLER [Restart sshd] **************************************************************
changed: [rke2-server2]
changed: [rke2-server3]
changed: [rke2-server1]
changed: [rke2-agent2]
changed: [rke2-agent1]
PLAY RECAP **********************************************************************************
rke2-agent1 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
批量更新系统软件包并重启 (建议)
适用于已在 Rocky Linux 9 上,仅需更新系统软件包的场景. 如无需重启,将
reboot_after_update设为false.
micro playbooks/update-packages.yml :
- name: Update Rocky Linux packages
hosts: all
become: true
serial: 1
vars:
reboot_after_update: true
tasks:
- name: Update package metadata
ansible.builtin.dnf:
update_cache: true
- name: Upgrade all packages
ansible.builtin.dnf:
name: "*"
state: latest
- name: Remove unneeded packages
ansible.builtin.dnf:
autoremove: true
- name: Clean package cache
ansible.builtin.command: dnf clean all
changed_when: false
- name: Reboot after update (optional)
ansible.builtin.reboot:
reboot_timeout: 3600
when: reboot_after_update
执行:
ansible-playbook playbooks/update-packages.yml
输出如下:
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/update-packages.yml
PLAY [Update Rocky Linux packages] **********************************************************
TASK [Gathering Facts] **********************************************************************
ok: [rke2-server1]
TASK [Update package metadata] **************************************************************
ok: [rke2-server1]
TASK [Upgrade all packages] *****************************************************************
ok: [rke2-server1]
TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server1]
TASK [Clean package cache] ******************************************************************
ok: [rke2-server1]
TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server1]
PLAY [Update Rocky Linux packages] **********************************************************
TASK [Gathering Facts] **********************************************************************
ok: [rke2-server2]
TASK [Update package metadata] **************************************************************
ok: [rke2-server2]
TASK [Upgrade all packages] *****************************************************************
changed: [rke2-server2]
TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server2]
TASK [Clean package cache] ******************************************************************
ok: [rke2-server2]
TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server2]
PLAY [Update Rocky Linux packages] **********************************************************
TASK [Gathering Facts] **********************************************************************
ok: [rke2-server3]
TASK [Update package metadata] **************************************************************
ok: [rke2-server3]
TASK [Upgrade all packages] *****************************************************************
changed: [rke2-server3]
TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server3]
TASK [Clean package cache] ******************************************************************
ok: [rke2-server3]
TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server3]
PLAY [Update Rocky Linux packages] **********************************************************
TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent1]
TASK [Update package metadata] **************************************************************
ok: [rke2-agent1]
TASK [Upgrade all packages] *****************************************************************
changed: [rke2-agent1]
TASK [Remove unneeded packages] *************************************************************
ok: [rke2-agent1]
TASK [Clean package cache] ******************************************************************
ok: [rke2-agent1]
TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-agent1]
PLAY [Update Rocky Linux packages] **********************************************************
TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent2]
TASK [Update package metadata] **************************************************************
ok: [rke2-agent2]
TASK [Upgrade all packages] *****************************************************************
changed: [rke2-agent2]
TASK [Remove unneeded packages] *************************************************************
ok: [rke2-agent2]
TASK [Clean package cache] ******************************************************************
ok: [rke2-agent2]
TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-agent2]
PLAY RECAP **********************************************************************************
rke2-agent1 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=6 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
部署 RKE2
编写 RKE2 变量
lablabs.rke2的rke2_config是模板路径(默认templates/config.yaml.j2),不要写成字典.需要写入config.yaml的参数请放到rke2_server_options/rke2_agent_options中.
micro group_vars/all.yml :
rke2_cluster_group_name: "rke2_cluster"
rke2_servers_group_name: "rke2_servers"
rke2_agents_group_name: "rke2_agents"
rke2_channel: "latest"
rke2_version: "v1.34.2+rke2r1"
rke2_token: "CHANGE_ME"
rke2_api_ip: "<LB或server1>"
rke2_additional_sans:
- "<LB或server1>"
rke2_selinux: true
rke2_cni:
- cilium
rke2_token是集群注册用的共享密钥,所有节点必须一致.rke2_api_ip是控制面入口地址: 有 LB/VIP 就填 LB/VIP 的 IP 或域名,无 LB/VIP 且每台机器仅有固定单 IP 时可以填首个控制面(如rke2-server1)的 IP/域名,并把该值同步加入rke2_additional_sans. 这种配置等同于把 API 固定到单节点,控制面入口不具备高可用,建议生产使用 LB/VIP.rke2_token可用openssl rand -base64 32生成. Rocky Linux 默认启用 SELinux 时,务必设置rke2_selinux: true,并确保安装container-selinux. 使用 Cilium 时将rke2_cni指向cilium.
micro group_vars/rke2_servers.yml :
rke2_server_options:
- write-kubeconfig-mode: "0644"
micro group_vars/rke2_agents.yml :
rke2_agent_options:
- node-ip: "{{ ansible_default_ipv4.address }}"
将首个控制面标记为初始化节点,micro host_vars/rke2-server1.yml :
rke2_server_options:
- write-kubeconfig-mode: "0644"
- cluster-init: true
编写 Playbook
micro playbooks/site.yml :
- name: Base setup
hosts: all
become: true
tasks:
- name: Install base packages
ansible.builtin.package:
name:
- curl
- tar
- socat
- conntrack
- iptables
- container-selinux
state: present
- name: Disable swap
ansible.builtin.command: swapoff -a
when: ansible_swaptotal_mb | int > 0
changed_when: false
- name: Remove swap from fstab
ansible.builtin.replace:
path: /etc/fstab
regexp: '^(.*\\sswap\\s.*)$'
replace: '# \\1'
- name: Load br_netfilter
ansible.builtin.modprobe:
name: br_netfilter
state: present
- name: Enable sysctl for Kubernetes
ansible.builtin.sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
state: present
reload: true
loop:
- { name: net.bridge.bridge-nf-call-iptables, value: 1 }
- { name: net.bridge.bridge-nf-call-ip6tables, value: 1 }
- { name: net.ipv4.ip_forward, value: 1 }
- name: RKE2 servers
hosts: rke2_servers
become: true
serial: 1
roles:
- role: lablabs.rke2
- name: RKE2 agents
hosts: rke2_agents
become: true
roles:
- role: lablabs.rke2
部署与验证
执行部署
先做一次语法检查:
ansible-playbook playbooks/site.yml --syntax-check
执行部署:
ansible-playbook playbooks/site.yml
获取 kubeconfig
登录任意控制面节点并导出 kubeconfig:
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
rke2 kubectl get nodes -o wide
如果在本地使用 kubectl,可以拷贝 kubeconfig:
mkdir -p ~/.kube
scp rke2-server1:/etc/rancher/rke2/rke2.yaml ~/.kube/rke2.yaml
sed -i 's/127.0.0.1/<LB或server1>/g' ~/.kube/rke2.yaml
export KUBECONFIG=~/.kube/rke2.yaml
kubectl get nodes -o wide
至此,最小高可用 RKE2 集群部署完成.